[pull] master from ray-project:master #56

pull · 2024-10-29T04:32:32Z

See Commits and Changes for more details.

Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

…ng information about individual connector pieces to the `ctor_args_and_kwargs` file). (#48213)

…ingzoo 1.24.3 supersuit 3.9.3). (#45328)

…mum_gcs_destroyed_actor_cached_count (#48229) Signed-off-by: dayshah <[email protected]>

…e it is iterating with each iterable value. (#48270) Signed-off-by: win5923 <[email protected]>

/salute Signed-off-by: Edward Oakes <[email protected]>

…t level" (#48292) Reverts #48223 Broke #48289 so revert to unblock the release.

## Why are these changes needed? Refactor handle options. - Change `_options` and `copy_and_update` to use `**kwargs` instead of listing out the entire set of handle options, so that in the future adding new options only requires code change in two locations: (1) the public facing `options()` API, and `_HandleOptions` itself. - Separate handle options into "dynamic" and "init" options. Dynamic options can be modified in a lightweight manner with `.options()`, while init options can only be modified once, before a handle is initialized, with `.init()`. - Initializing a handle is equivalent to instantiating the lazily-created router. - `.options()` and `.remote()` implicitly initialize a handle. --------- Signed-off-by: Cindy Zhang <[email protected]>

Reverts #48293

….4, pettingzoo 1.24.3 supersuit 3.9.3)." (#48297) Reverts #45328

## Why are these changes needed? Modifies the `route` attribute used for logs and metrics to limit its cardinality (see linked issue). After this change, the behavior will be: - If the user is not using FastAPI, the `route` will always be the `route_prefix`. - If the user is using FastAPI, the `route` will be the `route_prefix` + the matched route string. Take the following example: ``` @serve.deployment @serve.ingress(fastapi_app) class A: @fastapi_app.get("/{user_id}") def dynamic_subpath(self, user_id: str) -> str: return ... serve.run(A.bind(), route_prefix="/prefix") ``` Previously, this could have infinite cardinality in the metrics tags (a unique tag value for each user). With this change, the tag would instead be: `"/prefix/{user_id}"`. The "not found" response has also been modified to *not* include a `route` tag. TODO: - [x] Add unit tests for `get_asgi_route_name` - [x] Add integration tests to `test_metrics.py` - [x] Add tests for the replica request context ## Related issue number Closes #47999 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Edward Oakes <[email protected]>

…try` (#48234) Signed-off-by: Alexey Kudinkin <[email protected]>

…ING_OR_CONCURRENCY TaskStatus (#48242) Signed-off-by: Jiajun Yao <[email protected]>

…eployment level" (#48292)" (#48294) This reverts commit 703194a. The original commit caused `test_deploy_2.py` to become flaky. It appears this was a timing issue due to the 10s timeout set on the test. I've removed this, as it looks like a very old test case that is no longer relevant. --------- Signed-off-by: Edward Oakes <[email protected]>

… work on head node (#48103) This PR updates the compute configs for benchmark release tests to not schedule workers onto the head node. This reflects the best practice not to schedule heavy work on the head node for cluster stability. --------- Signed-off-by: Justin Yu <[email protected]>

## Why are these changes needed? depends on #48232 Pull out some router init parameters and router init logic into `create_router`. --------- Signed-off-by: Cindy Zhang <[email protected]>

## Why are these changes needed? Move `test_deployment_scheduler.py` from small to medium. Signed-off-by: Cindy Zhang <[email protected]>

…plify. (#47972)

If args is not a list, an exception will be thrown in the above if statement.

…ate_failed_env_runners=False` to `True`). (#48286)

## Why are these changes needed? Implement full recursive cancellation at Serve layer. Currently, at the time that a parent request is cancelled, if: - The child request has not been assigned to a replica, _meaning_ the replica scheduler in the router is still processing that request - AND the parent request is not directly blocking and waiting on the result of the child request Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request _was_ waiting on the result at the time of cancellation, then an `asyncio.CancelledError` will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true. This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled. Signed-off-by: Cindy Zhang <[email protected]>

Stacked on: #48223 Reimplements the `.bind()` codepath to avoid using the Ray DAG codepath which adds a lot of complexity for little benefit. The DAG traversal code is much simpler, around 100 lines of self-contained code in `build_app.py`. I've also added unit tests for it. There should be no behavior change here. --------- Signed-off-by: Edward Oakes <[email protected]>

…schedule work on head node" (#48321) Reverts #48103

…joco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3)."" (#48308)

Previously the error message was injected directly as the HTTP status message Signed-off-by: Richo Healey <[email protected]>

## Why are these changes needed? fix router.py Signed-off-by: Cindy Zhang <[email protected]>

…g compilation (#47757) Leaf nodes are nodes that are not output nodes and have no downstream nodes. If a leaf node raises an exception, it will not be propagated to the driver. Therefore, this PR raises an exception if a leaf node is found during compilation. Another solution: implicitly add leaf node to MultiOutputNode Currently, the function execute can return multiple CompiledDAGRefs. The UX we want to provide is to implicitly add leaf nodes to the MultiOutputNode but not return the references of the leaf nodes. For example, a MultiOutputNode is containing 3 DAG nodes (2 normal DAG nodes + 1 leaf node). x, y = compiled_dag.execute(input_vals) # We don't return the ref for the leaf node. However, the ref for leaf node will be GC(ed) in execute, and CompiledDAGRef’s del will call get if it was never called which makes execute to be a sync instead of an async operation which is not acceptable.

…#48358) Some tests are failing when we duplicate the original release tests. Marking them as unstable to be non-blocking. Will mark as stable once they are truly stable. Signed-off-by: Rui Qiao <[email protected]>

See #48228 (comment). Signed-off-by: Balaji Veeramani <[email protected]>

GcsServer manages several IO Contexts and their threads, and hard codes which class uses which IO Context. This is not flexible in that we can't easily declare another IO Context and adapt some classes without a pervasive code change. Introduces an IoContextProvider that manages IO Contexts based on a static, compile-time Policy, whose author can determine class-IO Context mapping. This way we can change the mappings in a central place without big chunk of changes everywhere. Signed-off-by: Ruiyang Wang <[email protected]>

To make the visualization more concise and easier to read: Show nodes in concise form and show actor information in the subgraph instead.

…uf (#48593) Signed-off-by: dentiny <[email protected]>

Signed-off-by: hjiang <[email protected]>

…U loader thread, enable local learner w/ GPU. (#48314)

and remove linux wheel testing logic. linux is not using this script anymore. Signed-off-by: Lonnie Liu <[email protected]>

… Consistent (#48088) Signed-off-by: Mengjin Yan <[email protected]>

The current graph shows memory usage of each node along side the MAX memory across the cluster. For oom debugging, we care more about memory utilization (percentage) on each node.

…ing tasks (#48528) Rename "Active tasks by name" to "Requested Live tasks". Make it clear that they do not only refer to running tasks but any tasks that has been requested to be scheduled. Add a graph for "Running tasks by name" that only includes "Running_*" status Rename "Active actors by name"

…type of container as a DAGNode arg (#48302) Based on #48045, we have decided to treat a case where a DAGNode is inside any type of container as a DAGNode argument as an error and throw an exception.

Fixes #48630 Signed-off-by: Balaji Veeramani <[email protected]>

In a recent investigation, we saw `Broken pipe` issue when a node is being preempted and going through the shutdown process. This is issue is due to a race condition: * The node is being preempted and is being gracefully shut down. * At the same time, remote hosts receives the notifications that the node state changes and send `FreeObjects` requests to the node * When handling the `FreeObjects` request, the Raylet will call `Delete` on the local Object store. * However, when the `Delete` happens, the object store has already been stopped and thus caused the Broken Pipe issue. The fix in the code is to move the logic to shutdown GRPC server before stopping the object store. In this sense, we can make sure when stopping the Object Store, there will not be on-going or future GRPC requests. Signed-off-by: Mengjin Yan <[email protected]>

@pcmoritz

Follow up to comment in #48609 that came in after the PR merged. cc: @pcmoritz Signed-off-by: angelinalg <[email protected]> Signed-off-by: angelinalg <[email protected]>

Two issues: 1. When the execution plan caches the dataset schema, it might call `unify_schema` to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function. 2. `to_pandas` iterates over blocks and adds them to `DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder` expects all input blocks to be of the same type. --------- Signed-off-by: Balaji Veeramani <[email protected]>

This PR adds docs to the `dev-workflow.md` complementing the changes in this PR: #48477 --------- Signed-off-by: Promptless Bot <[email protected]> Signed-off-by: Frances Liu <[email protected]> Signed-off-by: frances720 <[email protected]> Co-authored-by: promptless[bot] <179508745+promptless[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]>

## Why are these changes needed? Add Project operator to select_columns. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Richard Liaw <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

This PR integrates `uv` into existing runtime env system, which uses `uv` to setup environment for better performance. TODO list: - [ ] Add `uv` to public documentation --------- Signed-off-by: dentiny <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

…with custom DNS for Kubernetes API. (#48541) Signed-off-by: ltbringer <[email protected]>

and fix windows build Signed-off-by: Lonnie Liu <[email protected]>

In #47958 , visualization unit tests were introduced but would not run if CI environment does not have graphvis installed (which was the case). This PR adds the dependency in test-requirements.txt and always run the visualization tests. It also moves visualization tests to a separate file since test_accelerated_dag.py is too large.

Signed-off-by: dentiny <[email protected]>

## Why are these changes needed? Refactor resolve request arg func. Separate into: 1. actual type checking + argument resolution (injected into the Router as a function) 2. the logic for executing that resolution and replacing args/kwargs (kept as part of Router) --------- Signed-off-by: Cindy Zhang <[email protected]>

Signed-off-by: hjiang <[email protected]>

Now, when a Ray job is submitted, JobHead starts a Ray actor namely `JobSupervisor` which fork-and-exec a subprocess with the entrypoint. It bears all environment variables from the JobSupervisor, with some additional ones via `_get_driver_env_vars`. However we also want to *remove* some env vars, specifically the Ray internal ones including `RAY_JOB_ID` and `RAY_RAYLET_ID` because they are meant for the JobSupervisor, not for the job we submit. For example, in the entrypoint `$RAY_JOB_ID` can be set to `01000000` because that's the job id of the JobSupervisor, while the entrypoint really is for job 08000000. This did not show up as a bug because these environment variables are not read by the driver core workers. However we find a user pattern that they collect all env vars in the driver process and pass it to tasks/actors via `env_vars`. If they do so, the started workers get confused by the stale job ids and crash. Removes those env vars on JobSupervisor for-and-exec to hide them from the user entrypoint. Next steps: we should check for internal flags in `env_vars` and raise Exceptions for internal flags; and/or we can have an option to auto-remove them for the users. something like: ```py @ray.remote(runtime_env={ "env_vars":os.environ, # can be "raise", "allow_debugonly", "remove", default to "raise" "env_vars_config":{"internal_flags": "remove"}, }) ``` --------- Signed-off-by: Ruiyang Wang <[email protected]>

Signed-off-by: dayshah <[email protected]>

#47752 introduced a minor, silent issue in test_map. I stumbled across it because a linter I'm using complained.

## Why are these changes needed? Match for the more specific error that is raised, in this case `TypeError`, instead of matching for generic `RayTaskError`. --------- Signed-off-by: Cindy Zhang <[email protected]>

simonsays1980 and others added 30 commits October 27, 2024 21:18

[RLlib] Fix ConnectorPipelineV2 restoring from checkpoint (by writi…

6878aa1

…ng information about individual connector pieces to the `ctor_args_and_kwargs` file). (#48213)

[RLlib] Upgrade to gymnasium 1.0.0 (ale_py 0.10.1, mujoco 3.2.4, pett…

bfd0d95

…ingzoo 1.24.3 supersuit 3.9.3). (#45328)

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maxi…

ad2925f

…mum_gcs_destroyed_actor_cached_count (#48229) Signed-off-by: dayshah <[email protected]>

[CI] fix: flake8 rule B020: Found for loop that reassigns the iterabl…

1484f88

…e it is iterating with each iterable value. (#48270) Signed-off-by: win5923 <[email protected]>

[serve] Update codeowners (#48291)

159bd99

/salute Signed-off-by: Edward Oakes <[email protected]>

Revert "[serve] Remove ability to specify route_prefix at deploymen…

703194a

…t level" (#48292) Reverts #48223 Broke #48289 so revert to unblock the release.

Fix broken requirements_boyd_3.9 (#48293)

6309e4b

Revert "Fix broken requirements_boyd_3.9" (#48296)

bdd6ad0

Reverts #48293

Revert "[RLlib] Upgrade to gymnasium 1.0.0 (ale_py 0.10.1, mujoco 3.2…

d782b84

….4, pettingzoo 1.24.3 supersuit 3.9.3)." (#48297) Reverts #45328

[Observability] Added object refs Task is dependent on to `TaskInfoEn…

cacb54c

…try` (#48234) Signed-off-by: Alexey Kudinkin <[email protected]>

[Core] Add PENDING_ACTOR_TASK_ARGS_FETCH and PENDING_ACTOR_TASK_ORDER…

62525c0

…ING_OR_CONCURRENCY TaskStatus (#48242) Signed-off-by: Jiajun Yao <[email protected]>

[serve] refactor router (#48252)

b71f6b7

## Why are these changes needed? depends on #48232 Pull out some router init parameters and router init logic into `create_router`. --------- Signed-off-by: Cindy Zhang <[email protected]>

[serve] move deployment scheduler test (#48303)

a97c23c

## Why are these changes needed? Move `test_deployment_scheduler.py` from small to medium. Signed-off-by: Cindy Zhang <[email protected]>

[RLlib] AutoregressiveActionsRLM overhaul to fix flaky test and sim…

a24bf07

…plify. (#47972)

[core][compiled graph] remove dead code (#48282)

43c3663

If args is not a list, an exception will be thrown in the above if statement.

[RLlib] Change config.fault_tolerance default behavior (from `recre…

44237c6

…ate_failed_env_runners=False` to `True`). (#48286)

Revert "[ci][tune][train] Update release test compute configs to not …

d80b22f

…schedule work on head node" (#48321) Reverts #48103

Revert "Revert "[RLlib] Upgrade to gymnasium 1.0.0 (ale_py 0.10.1, mu…

918bbd1

…joco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3)."" (#48308)

Fix response injection (#47810)

2e1b0f6

Previously the error message was injected directly as the HTTP status message Signed-off-by: Richo Healey <[email protected]>

Fix lint (#48322)

5e252d7

## Why are these changes needed? fix router.py Signed-off-by: Cindy Zhang <[email protected]>

[Data] Ensure all tests use shared cluster in test_csv (#48352)

40ad2da

See #48228 (comment). Signed-off-by: Balaji Veeramani <[email protected]>

ruisearch42 and others added 29 commits November 6, 2024 19:53

[core][compiled graphs] Refine schedule visualization (#48594)

3b2e13b

To make the visualization more concise and easier to read: Show nodes in concise form and show actor information in the subgraph instead.

[core] Minor improvement for worker pool, and add comments for protob…

0da142a

…uf (#48593) Signed-off-by: dentiny <[email protected]>

[core] Add source location to ray status (#48512)

9581896

Signed-off-by: hjiang <[email protected]>

[RLlib] Fix IMPALA/APPO learning behavior: Fix EnvRunner sync bug, GP…

359f40a

…U loader thread, enable local learner w/ GPU. (#48314)

[ci] fix osx wheel matching logic (#48615)

d3e4a51

and remove linux wheel testing logic. linux is not using this script anymore. Signed-off-by: Lonnie Liu <[email protected]>

[Core] Make Placement Group Wildcard and Indexed Resource Assignments…

8deca45

… Consistent (#48088) Signed-off-by: Mengjin Yan <[email protected]>

[Dashboard] Add memory graphs optimized for OOM debugging (#48530)

4ecb0af

The current graph shows memory usage of each node along side the MAX memory across the cluster. For oom debugging, we care more about memory utilization (percentage) on each node.

[core][compiled-graph] Throw an exception when DAGNode is inside any …

fafe308

…type of container as a DAGNode arg (#48302) Based on #48045, we have decided to treat a case where a DAGNode is inside any type of container as a DAGNode argument as an error and throw an exception.

[Data] Allow user to specify schema for write_parquet (#48631)

a20c7bc

Fixes #48630 Signed-off-by: Balaji Veeramani <[email protected]>

[doc][core] change ml app to ml workload, and style fix (#48613)

b172af5

Follow up to comment in #48609 that came in after the PR merged. cc: @pcmoritz Signed-off-by: angelinalg <[email protected]> Signed-off-by: angelinalg <[email protected]>

[Core] Rename ClientFactoryFn to CoreWorkerClientFactoryFn (#48576)

06de092

Signed-off-by: Jiajun Yao <[email protected]>

[Core]: Fix ConnectionError on Autoscaler CR lookups in K8s clusters …

fdd4757

…with custom DNS for Kubernetes API. (#48541) Signed-off-by: ltbringer <[email protected]>

[core] fix import path (#48644)

cae5910

and fix windows build Signed-off-by: Lonnie Liu <[email protected]>

[core] [4/N] Package manager uv supports requirement file (#48619)

871d4d0

Signed-off-by: dentiny <[email protected]>

[core] Fix function naming and error propagation (#48639)

bb0adba

Signed-off-by: dentiny <[email protected]>

[core] Fix vector assignment (#48648)

0f74ac8

Signed-off-by: dentiny <[email protected]>

[core] [5/N] Check uv existence before installation (#48632)

ecbc176

Signed-off-by: hjiang <[email protected]>

[core] Minor cpp changes around core worker (#48262)

e2dfcc8

Signed-off-by: dayshah <[email protected]>

[Data] Fix usage of "id" in test_map (#48655)

d4c25e9

#47752 introduced a minor, silent issue in test_map. I stumbled across it because a linter I'm using complained.

[serve] small test change (#48665)

9adfd9f

## Why are these changes needed? Match for the more specific error that is raised, in this case `TypeError`, instead of matching for generic `RayTaskError`. --------- Signed-off-by: Cindy Zhang <[email protected]>

Superskyyy merged commit 1970e6a into Superskyyy:master Nov 9, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master #56

[pull] master from ray-project:master #56

pull bot commented Oct 29, 2024

[pull] master from ray-project:master #56

[pull] master from ray-project:master #56

Conversation

pull bot commented Oct 29, 2024